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Abstract 

Existing eiastic network modeis are typicaiiy parametrized at a given cutoff distance and often faii to prop- 
eriy predict the thermai fiuctuation of many macromoiecuies that invoive muitipie characteristic iength scaies. 

We introduce a muitiscaie fiexibiiity-rigidity index (mFRI) method to resoive this probiem. The proposed mFRI 
utiiizes two or three correiation kerneis parametrized at different iength scaies to capture protein interactions at 
corresponding scaies. it is about 20% more accurate than the Gaussian network model (GNM) in the B-factor 
prediction of a set of 364 proteins. Additionaiiy, the present method is abie to deiivery accurate predictions for 
muitiscaie macromoiecuies that faii GNM. Finaiiy, or a protein of N residues, mFRI is of iinear scaiing {0{N)) in 
computationai compiexity, in contrast to the order of 0{N^) for GNM. 

Proteins are among the most essential biomolecules for life. Many protein functions, such as structure sup¬ 
port, catalyzing chemical reactions, and allosteric regulation are strongly correlated to protein flexibility.^^ Protein 
flexibility is an intrinsic property of proteins and can be measured directly or indirectly by many experimental 
approaches, such as X-ray crystallography, nuclear magnetic resonance (NMR) and single-molecule force ex¬ 
periments.® Theoretically, protein flexibility can be computed by normal mode analysis graph 

theory,^® rotation translation blocks (RTB) method,®’®® and elastic network model including 

Gaussian network model (GNM)"^’® and anisotropic network model (ANM).® A common feature of the above 
mentioned time-independent methods is that they resort to the matrix diagonalization procedure. The com¬ 
putational complexity of the matrix diagonalization is typically of the order of 0{N^), where N is the number 
of elements in the matrix. Such a computational complexity calls for new efficient strategies for the flexibility 
analysis of large biomolcules. 

It is well known that NMA and GNM do not work well for many macromoiecuies. Park et al. had collected three 
sets of structures to test performance of NMA and GNM methods.®® It was found that both methods fail to work 
and deliver negative correlation coefficients for many structures.®® The mean correlation coefficients (MCCs) for 
the B-factor prediction of small-sized, medium-sized and large-sized sets of structures are about 0.480, 0.482 
and 0.494 for NMA, respectively.®®’®® The GNM preforms slightly better, with the mean correlation coefficients 
of 0.541, 0.550 and 0.529 for the above test sets.®® ®® Obviously, there is a pressing need to develop innovative 
approaches for biomolecular flexibility analysis. 

Recently, we have proposed a few matrix-decomposition-free methods for flexibility analysis, including molec¬ 
ular nonlinear dynamics,®® stochastic dynamics®® and flexibility-rigidity index (FRI).®®’®"^ Among them, flexibility- 
rigidity index (FRI) has been introduced to evaluate protein flexibility and rigidity. The fundamental assumptions 
of the FRI method are as follows. Protein functions, such as flexibility, rigidity, and energy, are fully determined 
by the structure of the protein and its environment, and the protein structure is in turn determined by the relavent 
interactions. Therefore, whenever the protein structure is available, there is no need to analyze protein flexibility 
and rigidity by tracing back to the protein interaction Hamiltonian. Consequently, the FRI bypasses the 0{N^) 
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matrix diagonalization. Our initial FRI^'^ has the computational complexity of of 0{N‘^) and our fast FRI (fFRI)^^ 
based on a cell lists algorithm^ is of 0{N). The FRI and the fFRI have been extensively validated by a set of 
365 proteins for parametrization, accuracy and reliability. The parameter free fFRI is about ten percent more 
accurate than the GNM on the 365 protein test set and is orders of magnitude faster than GNM on a set of 44 
proteins. FRI is able to predict the B-factors of an HIV virus capsid (313 236 residues) in less than 30 seconds on 
a single-core processor, which would require GNM more than 120 years to accomplish if the computer memory 
is not a problem.See the supplementary material for detail. 

Nevertheless, there are structures for which FRI does not work either. In fact, for those structures that fail 
NMA and GNM are likely to be difficult for FRI as well. One such structure is pictured in Figure 2 where the 
GNM method fails to predict the high flexibility of a hinge region in calmodulin with any cutoff distance. There 
are a number of reasons for this and other types of failure. Crystal environment, solvent type, co-factors, data 
collection conditions, and structural refinement procedures are well-known causes. 

However, there is one more important cause that has not been discussed in the literature to our best knowl¬ 
edge, namely, multiple characteristic length scales in a single protein structure. Indeed, contrary to small 
molecules, macromolecular interactions have a wide variety of characteristic length scales, ranging from co¬ 
valent bond, hydrogen bond, wan der Waals bond, residue, alpha helix and beta sheet, domain and protein 
scales. Protein flexibility is intrinsically associated with protein interactions, and thus must have a multiscale 
trait as well. When GNM or FRI method is parametrized at a given cutoff or scale parameter, it captures only a 
subset of the characteristic length scales but inevitably misses other characteristic length scales of the protein. 
Consequently, none of them is able to provide an accurate B-factor prediction. 

Multiscale flexibility-rigidity index (mFRI) is constructed to capture the multiscale collective motions of macro¬ 
molecules. We utilize multiple correlation kernels, with each kernel being parametrized at specific scale to char¬ 
acterize the multiscale flexibility of macromolecules. The nth flexibility index of the fth (coarse-grained) particle 
is given by 


where wj is an atomic type dependent parameter, 4>”(||ri - rj||;??p is a correlation kernel and r?” is a scale 
parameter. Here r* and are the coordinates for fth and jth particles, respectively. We seek the minimization 
of the form 





J2a-fr + b-Bt 


( 2 ) 


where {Bf} are the experimental B-factors. We use generalized exponential kernels^^’^"^ 

$”(|lr- r.ll;??;) = e-d '—k > 0 


(3) 


and generalized Lorentz kernels 




!;<) = 


1 + (lk-rjli/77^")' 


> 0. 


(4) 


In principle, all parameters can be optimized. For simplicity and computational efficiency, we only determine 
{a"} and b in the above minimization process. In this work, we limit the number of kernels to at most there and 
set < = 1- Both generalized exponential kernels and generalized Lorentz kernels are employed. More detailed 
description of the mFRI is given in the supplementary material. 
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Figure 1: An illustration of multiscale behavior in protein flexibility analysis. Two Lorentz kernels {v = 3) are used. Their scale values »? 
values are listed along the horizontal and vertical axes. The mean correlation coefficient value for B-factor prediction on a set of 364 proteins 
is shown in each cell of the matrix and color coded for convenience with red representing the highest correlation coefficients and green the 
lowest. Obvious, the combination of a relatively small-scale kernel and a relatively large-scale kernel delivers best prediction, which indicates 
the importance of incorporating multiscale in protein flexibility analysis. 

To understand the multiscale behavior of flexibility analysis, we consider a test set containing 364 protein 
structures whose Protein Data Bank (PDB) identities are listed in the literature^^ and it contains test sets used 
in GNM studies.^® This test set omits one structure present in previous FRI studies (PDB ID: 1AGN) due to 
unrealistic B-factor data. Our goal is to examine how an additional kernel with a large length scale impacts the 
flexibility analysis. To this end, we consider two smooth Lorentz type of kernels with v = 3. We explore a number 
of scale combinations as shown in Fig. 1, which plots the MCC values for B-factor prediction on the set of 364 
structures. The low MCC values on the diagonal line indicate that two-scale methods are always better than 
a single scale one. The best results are achieved at the combination of a relatively small-scale kernel and a 
relatively large-scale kernel. This behavior proves the importance of incorporating multiscale in the biomolecular 
flexibility analysis. The best MCC for the test set is 0.67, which is about 20% better than the best GNM prediction 
and about 6% improvement over our single scale FRI approach. 

The improvement in the MCC for B-factor prediction on a set of 364 proteins discussed above obscures the 
fact that the proposed multiscale method is able to captures the multiscale behaviors in many structures that fail 
the original FRI and GNM. In the rest of this paper, we demonstrate utility of the proposed multiscale method by 
a few case studies. A three-scale FRI is employed. 

Protein hinge regions have been shown to be correlated with active sites and catalysis in enzymes. Flexibility 
has a major role in specificity of binding of a protein to other proteins, nucleic acids or other molecules. An 
active site or docking region that is more flexible will accommodate more varied substrates or partners while 
more rigid domains are more specific. Protein hinges are also found separating large domains of proteins. In 
this context, the hinges can be very important for protein conformational changes. The protein featured in this 
section, calmodulin, is a good example of a hinge that affects both structure and function. 

The central region of calmodulin shown in Figure 2 is a long a-helix which is unwound or kinked at the middle 
when no calcium is bound to the two distall metal coordinating domains. In both forms, with or without calcium 
bound, this helix retains a large degree of flexibility based on B-factor values from the PDB files (1CLL and 
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Figure 2: Top, the structure of calmodulin (PDB ID: 1CLL) visualized in VMD^^ and colored by experimental B-factors (left), mFRI predicted 
B-factors (middle) and GNM predicted B-factors {right)with red representing the most flexible regions. Bottom, the experimental and predicted 
B-factor values plotted per residue. The GNM7 is for the GNM method with a cutoff distance of 7A. Clearly GNM misses the flexible hinge 
region. The mFRI is parametrized at = 3, = sA, = 3 , = 7 A, = 1, and = IsA. 

1CFD). 

Many tools exist for the prediction and anaiysisof hinges in proteins using bioinformatics,graph theory^°’^®’^^ 
and energetics.^^ The proposed mFRI has capabilities similar to those in these tools. The mFRI can be used to 
predict hinge regions by regions of high FRI values or predicted B-values. 

A comparison of mFRI method and GNM for the B-factor prediction of calcium-bound calmodulin is displayed 
in Figure 2. B-factor prediction by single kernel FRI and GNM is unable to accurately predict the hinge region in 
the middle of the protein with any parameter. Two- and three-kernel based mFRI methods, on the other hand, 
are much more accurate in the hinge region. As more kernels are added, the accuracy can be seen to grow but 
sufficient accuracy is achieved at three kernels. 

We have shown in our supplementary material that a similarly good B-factor prediction for calmodulin type of 
structures can be achieved by the original FRI method if the crystal effect is taken into consideration. This result 
suggests that the proposed mFRI method may be able to take care some crystal effects. 
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Figure 3: Top, a visual comparison of experimental B-factors (left), mFRI predicted B-factors (middle) and GNM predicted B-factors (right) 
for the engineered teal flourescent protein, mTFPI (PDB ID:2HQK). Bottom, The experimental and predicted B-factor values plotted per 
residue. The GNM naming convention indicated the cutoff used for the GNM method in A, i.e., GNM7 is the GNM method with a cutoff of 
7A, etc. 

Cyan fluorescent protein (CFP), shown in Figure 3, is a homoiog of the famous green flourescent protein 
(GFP). Isoiated from the crystai jeiiyfish in the 1990s,GFP enabied a revoiution in biochemistry by aiiowing the 
tagging and tracking of a wide range of moiecuies. CFP was found iater in Anthozoa species which have turned 
out to be a good source of fluorescent proteins with varied emission spectra.^"^ In this exampie we examine the 
flexibiiity of an engineered CFP^ (PDB ID: 2HQK), mTFPf. It is ciear in Figure 3 that GNM B-factor predictions 
contain a iarge error around residues 50-60 which is very pronounced at the recommended cutoff of 7 A and is 
stiii somewhat probiematic when the cutoff is changed to 8 A. mFRI on the other hand has no issue with this 
particuiar region. Upon further inspection, it is ciear that the offending region is the smaii, aipha-heiicai region 
suspended in the center of the beta-barrei. It is not surprising that this sort of configuration wouid be highiy 
cutoff dependent in a scheme such as GNM, which has hard cutoffs for connectivity. It would appear that this 
structure is dominated by short range interaction but the region of residues 50-60 is affected to a large degree 
by mid-range interactions, i.e., there are at least two important scales of interaction in this case. It follows then 
that mFRI, which has kernels to capture short- and mid-range interactions, would perform better than GNM7 or 
GNM8 methods alone in B-factor predictions. Figure 3, which is exactly what we see from the results. 
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Figure 4: Top, a visual comparison of experimental B-factors (left), mFRI predicted B-factors (middle) and GNM predicted B-factors (right) for 
the engineered teal flourescent protein, mTFPI (PDB ID:1 V70). Bottom, The experimental and predicted B-factor values plotted per residue. 

A similar situation exists with the structure 1V70, a probable antibiotic synthesis protein, which is shown in 
Figure 4. As in the last example, the problematic portion for B-factor prediction comes at the end of a protein 
chain. In this case there is an overestimation of flexibility for residues 1-10 when using GNM. Again, varying 
parameters from the recommended 7A results in marginally better results, however no parametrization is able to 
reach the accuracy of mFRI. 
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Figure 5: Top, a visual comparison of experimental B-factors (left), FBI predicted B-factors (middle) and GNM predicted B-factors (right) for 
the ribosomal protein L14 (PDB ID:1 WFII). Bottom, The experimental and predicted B-factor values plotted per residue. 

The final example is a biologically important molecule, ribosomal subunit LI4, a component of the 60S ri¬ 
bosomal subunit.^ Depicted in Figure 5, L14 is a structurally diverse protein containing regions of alpha helix, 
beta-barrel, parallel beta strands and a beta-hairpin motif. The pattern of flexibility predicted by GNM for this 
structure is shown to be over exaggerated, i.e., rigid areas are predicted to be more rigid than they actually are 
and vice verse. This pattern exists in most GNM results due to the use of a hard cutoff in the Kirchhoff matrix. 
Such a hard cutoff will inevitably lead to the overestimation of bond importance near the edge of the cutoff, 
therefore, if a large number of interactions exist for a particular atom near the cutoff point, there is likely to be a 
large error in the estimation of flexibility for that atom. This is likely what is happening with the errors in GNM 
calculation of the proteins in Figures 3, 4 and 5, the protein at the end of the chain may be near the edge of the 
cutoff distance for many interactions with the bulk of the proteins. While adjusting GNM’s cutoff distance may 
temper the error being introduced, it cannot eliminate it completely unless they change to a soft-decaying kernel 
method such as FRI. Nevertheless, soft-decaying kernel based methods can only alleviate the problem. They 
do not deliver satisfactory B-factor predictions unless a multiscale strategy is employed. We note that it is not 
obvious how to incorporate a multiscale strategy in matrix diagonalization based methods. 
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Supplementary Material 


Flexibility-rigidity index 

In FRI, the topological connectivity of a biomolecule is measured by rigidity index and flexibility index. In particu¬ 
lar, the rigidity index represents the protein density profile. Consider an iV-atom representation of a biomolecule. 
The coordinates of these atoms are given as {rj\rj g R^,j = 1,2, •• • ,N}. We denote Hr* - rj|| the Euclidean 
space distance between the zth atom and the jth atom. A general correlation kernel, <I)(||r - rj\\;r]j), is a real¬ 
valued monotonically decreasing function satisfying 

^(I|i’-rj||;r 7 j) = 1 as |jr-rj||^0 (5) 

$(||r-rj||;r 7 j) = 0 as ||r-rj||^oo, (6) 

where rjj is an atomic type dependent scale parameter. The correlation between the fth and jth particles is given 
by 


The correlation matrix {Cij} can be computed to visualize the connectivity among protein particles. 
We define a position (r) dependent rigidity function or density function^^’^"^ 


(7) 


N 


( 8 ) 


where wj is an atom type dependent weight. For example, carbon, nitrogen and phosphorus atoms can have 
different weights. Although Delta sequences of the positive type discussed in an earlier work^^ are all good 
choices, generalized exponential functions 

$(||r — rj||;r 7 j) = , k>0 (9) 


and generalized Lorentz functions 


1 

l + {\\r-r,\\/r^,r^ 


V > 0 


( 10 ) 


have been commonly used in our recent work.^®’^'* Since the rigidity function can be directly interpreted as a 
density function, it can been used to define the rigidity surface of a biomolecule by taking an isovalue. By taking 
K = 2 in Eq. (9), we result in a formula for a Gaussian surface. 

Similarly, we define a position (r) dependent flexibility function^®’^'^ 


F(r) 


1 


( 11 ) 


This function is well defined in the computational domain containing the biomolecule. The flexibility function can 
be visualized by its projection on a given surface, such as the solvent excluded surface of a biomolecule. 

The rigidity index for the zth particle is defined as 


N 

A* = '^Wjm\r,-rj\\-,r]j). (12) 

i=i 

Here Hi measures the total density or rigidity at the zth particle. In a similar manner, we define a set of flexibility 
indices by 


The flexibility index /* is directly associated with the B-factor of zth particle 

Bl = af, + b, Vz = l,2,.-- ,7V (14) 

where {Bj} are theoretically predicted B-factors, and a and b are two constants to be determined by a simple 
linear regression. This allows us to use experimental data to validate the FRI method. In our earlier work,^®’^'* 
we set Wj = 1 for the coarse-grained Cq, representation of proteins. We have also developed parameter free FRI 
(pfFRI), such as (k = 1,?7 = 3) and {v = 3,ri = 3), to make our FRI robust for protein Ca B-factor prediction. 


10 



Multiscale FRI 

The basic idea of muitiscaie FRI (mFRI) is quite simpie. Since macromoiecuies are inherentiy muitiscaie in na¬ 
ture, we utiiize muitipie correiation kerneis that are parametrized at muitipie scaies to characterize the muitiscaie 
flexibiiity of macromoiecuies 


/” = 




(15) 


where w”, $”(|lrj - rj\\;r]'j ) and 77 ” are the corresponding quantities associated with the nth kernei. We seek 
the minimization of the form 







a^fn + b-Bt 


(16) 


where {Bf} are the experimentai B-factors. In principle, all parameters can be optimized. For simplicity and 
computational efficiency, we only determine {a”} and b in the above minimization process. For each kernel 
w- and 77 ” will be selected according to the type of particles. 

Specifically, for a simple Cq, network, we can set < = l and choose a single kernel function parametrized at 
different scales. The predicted B-factors can be expressed as 


^mFRI 


b + 


^iE;ii<i^(lk.-r,||;r7") 




(17) 


The difference between Eqs. (15) and (17) is that, in Eqs. (15), both the kernel and the scale can be changed 
for difefrent n. In contrast, in Eq. (17), only the scale is changed. One can use a given kernel, such as 


$(|lr-rj||;r7") 


1 

1 + (lk-rj||/?7")^’ 


(18) 


to achieve good multiscale predictions. 

Result evaluation 

To quantitatively assess the performance of the proposed multikernel based mFRI method, we consider the 
correlation coefficient (CC) 


where {Bl,i = 1,2, • • • , TV} are a set of predicted B-factors by using the proposed method and {Bf,i = 
1,2, ••• ,N} are a set of experimental B-factors extracted from the PDB file. Here B* and i?® the statistical 
averages of theoretical and experimental B-factors, respectively. 
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Figure 6: Computational efficiency of multiscale fast FRI (multi fFRI) relative to single kernel fast FRI (fFRI) and GNM. The data sets used for 
the present efficiency study are the same as those listed in Table VIII of Ref.^® The largest test molecule on the right is an FIIV virus capsid, 
which has than 313236 amino acid residues and takes about 30s for the mFRI method to compute its B-factor. 


Parameterization in case studies 


N 

r;2=9A 

772=12A 

77^=15A 

772=17A 

j72=20A 

r72 =25A 

0-99 

0.055 

0.083 

0.100 

0.102 

0.097 

0.083 

100-199 

0.061 

0.093 

0.101 

0.100 

0.099 

0.093 

200-299 

0.051 

0.087 

0.097 

0.097 

0.095 

0.087 

300-399 

0.069 

0.108 

0.115 

0.119 

0.123 

0.108 

400-499 

0.079 

0.126 

0.148 

0.157 

0.155 

0.126 

500-^ 

0.064 

0.107 

0.136 

0.143 

0.140 

0.107 

Overall 

0.060 

0.094 

0.106 

0.108 

0.106 

0.094 


Table 1: Improvements in mean correlation coefficients (MCC) for the B-factor prediction of a set of 364 proteins due to the introduction of 
an additional kernel parametrized at a large scale (rf). Two exponential kernels with k = 25 are employed. The first kernel’s scale value is 
set to = 7.0A in all cases. The second kernel’s scale value (if) is varied and listed on the top of the table. Results are organized and 
split by the size of the structures based on the number of amino acids in order to show the impact of different rf values on different sizes of 
proteins. 

In all of our case studies, we have used both mFRI and GNM to predict B-factors. When GNM performed 
pooriy, different parameters were tried to see if there is a more ideai parametrization. The resuits of B-factor 
prediction are mapped on to the residues for visuai comparison and shown piotted against the experimentai 
vaiues for more detaii. The mFRI method used in our case studies combines three kerneis. After some testing 
we have decided upon using one kernei of exponentiai decay (k = 1) and two kerneis of Lorentz type [v = 3) 
with different scaie {rj) parameter vaiues. The choice of kerneis and parameters is driven by the idea that each 
kernei shouid capture interactions of different ranges, e.g., short-, medium- and iong-range interactions each 
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being represented by a different kernei. The exponentiai kernei is chosen to represent the siowest decaying 
forces with 77 ^ = IsA and k = I whiie the two Lorentz type of kerneis capture reiative short- and medium-range 
interactions with parameters v = 3, r]^ = 3 .oA and 77 ^ = 7A, respectiveiy. The associated MCC for the 364 test 
set is 0.689, which is about 22% better than what obtained by using the GNM method.^® Other combinations of 
kernei parameters were tried in which the exponentiai kernei exhibited the quickest decay, however, they did not 
perform as weii in B-factor prediction tests. The fast decaying Lorentz kernei, 77 ^ = sA and v = 3, may be weii 
suited to capture the effect of chemicai bonds due to its particuiar shape of decay which highiy favors interactions 
beiow 3.0 A. 

More detailed analysis of macromolecular multiscale behavior 

To further expiore the importance of muitiscaie methods, we use kerneis having a sharp decay behavior simiiar 
to that of a Heaviside step function. This is achieved by setting k = 25 for the exponentiai type of correiation 
kerneis. In this case, the one-kernei FRI method behaves iike the GNM method. The best performance for 
one-kernei FRI is obtained at 77 ^ = 7A and the associated mean correiation coefficient (MCC) for the 364 test set 
is 0.540, which is simiiar to that obtained by using GNM.^® Obviousiy, the cutoff type of kernei behavior obtained 
at K = 25 does not recognize any iarge-scaie correiation beyond 7A in macromoiecuies. To capture iarge-scaie 
correiations, we empioy the second exponentiai kernei with its scaie ( 77 ^ > 77 ^) varying over a range of vaiues as 
shown in Tabie 1. 

To anaiyze the scaie behavior due to protein size, we ciassify 364 proteins into 6 groups. The improvements 
of MCC due to the introduction of an additionai kernei are iisted in Tabie 1 for a number of iarge scaie vaiues rf. 
First, the B-factor predictions from aii size ciasses are significantiy benefited from the introduction of the iarge- 
scaie kernei. Additionaiiy, at the scaie vaiue of 77 ^ = 17A, the MCC is 0.648 and the associated improvement to 
the originai FRI or GNM methods is 20% for the set of 364 proteins. Note that this muitiscaie improvement cannot 
be easiiy achieved by GNM, NMA, or any other mode decomposition based methods. Moreover, the iarge- 
scaie kernei ieads to the most significant improvement in the B-factor prediction for reiativeiy iarge proteins, i.e., 
proteins with 400-499 residues, which indicates that iarge proteins have more significant muitiscaie correiations 
than smaii proteins do. Finaiiy, the improvement in the B-factor prediction for proteins with more than 500 
residues is not as much as that for proteins with 400-499 residues, which indicates that two scaies are not 
enough to capture aii the muitiscaie correiations in proteins with more than 500 residues. This observation 
suggests that three or more scaies are needed for the B-factor prediction of excessiveiy iarge proteins. 
Computational complexity 

It has been previously been demonstrated that the computational complexity of the single kernel FRI method is 
asymptotically of 0{N^). By making use of the cell lists algorithm, fFRI achieves a computational complexity of 
0{N). The addition of multiple kernels to the FRI method does not affect this aspect of scaling, however, the 
running time for B-factor prediction does increase with each additional kernel slightly. Indeed, the multi-kernel 
regression requires to optimize one more parameter with the addition of each new kernel. The impact of these 
changes on the running time of FRI based B-factor prediction is shown in Figure 6 . We employ the same data 
sets and test conditions as those described in our earlier paper^^ for the present test. The data used for testing 
mFRI and fFRI are the same as those used in testing the fFRI in Table VIII of Ref.^® In testing the GNM, the 
same data set as that listed in Table VIII of Ref.^® is employed. 
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Figure 7: Top, a visual comparison of atomic experimental B-factors (far left), C-alpha experimental B-factors (left), mFRI predicted B-factors 
(middle) and GNM predicted B-factors (right) for the marine snail conotoxin (PDB ID:1NOT). Bottom, The experimental and predicted B-factor 
values plotted per residue. 

Clearly the Impact of extra kernels does not affect the essentially linear scaling of fFRI with lines of fit for fFRI 
and multikernel fast FRI (multi fFRI) being t ^ 7 * 10“® * and t = 8 * 10“® * ]\[0.959 respectively. The 

Increase In computation time Is minor especially for molecules with smaller numbers of atoms. In contrast, the 
line of fit for the GNM Is f = 4 * 10“® * 7 ^ 3.09 25 |\|Q(g ggg|-| increase In one additional kernel leads to only 
one more fitting parameter, for which the fitting time Is negligibly small. Only In extreme cases, with systems far 
larger than those currently studied atomistically, might single kernel FRI be preferred. Therefore, It Is preferable 
to use multikernel based mFRI over single kernel FRI provided there Is a significant Increase In accuracy and 
reliability, as was demonstrated previously. Note that the largest test molecule Is an HIV virus capsid, which has 
than 313236 amino add residues. It would take the GNM more than 120 years to finish the prediction If the 
computer memory Is not a problem. In contrast, the proposed mFRI does the job In about 30 seconds or less on 
a single workstation depending on the processing power. 

Additional example 

The proposed mFRI not only works better for the B-factor prediction of proteins, but also for small molecules. 
One of such examples Is a peptide molecule, a predatory marine snail toxin, shown In Figure 7. This peptide 
adopts a cyclical secondary structure which Is made up of two connected loops created by two disulfide bonds. 
In this structure there happens to be a particular residue at the beginning of the chain which Is much more 
flexible than the others. This Is a difficult case for flexibility prediction, especially coarse-grained predictions, as 
there may be side-chain Interactions making large contributions to the flexibility of some atoms and there are 
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Figure 8: The consideration of the crystal packing effect in FRI. Top: The crystal packing of 10SA. The origin protein structure is shown in 
red color. Bottom: The experimental B-factors and predicted ones. The correlation coefficient for FRI is 0.583. Using the packing information, 
the correlation coefficient value improves to 0.781. 

two disulfide bonds that link. Nevertheless, mFRI is able to accurately reproduce the high flexibility of the first 
residue. GNM on the other hand is unable to recreate the pattern of flexibility at any parametrization. This is 
again due to the use of a hard cutoff in the GNM method and the use of a single kernel. The differences in 
distances between residues in this structure are too subtle to be captured by a method that treats distance with 
a hard cutoff. The kernels used in FRI are sensitive enough to detect the difference in distances between atoms 
in this structure which leads to finding the single stand-out residue. 

Crystal packing effect 

The crystal packing is another very important element in theoretical B-factor prediction. The consideration of 
the crystal structure usually improves the accuracy of predictive models. We test our FRI model on a widely 
used example, i.e., protein 10SA, which is another calmodulin structure. We consider all the copies in the 
crystal within 10 A distance of the protein. Figure 8 demonstrates the crystal packing information. We modify 
our flexibility index by incorporation of all the Ca atoms within the above crystal packing structure. For our FRI 
model, we choose the exponential kernel with k = 1,77 = 4.0A. The correlation coefficient for the original FRI is 
0.583. When the crystal packing information is considered, the correlation coefficient value improves to 0.781. 

It is interesting to compare present Fig. 8 with the Fig. 2 in our paper. Clear, two calmodulin structures have 
a similar behavior and are both difficult for FRI. This difficult can be resolved either by a consideration of crystal 
effects or by using the proposed mFRI method. This example indicates that the proposed mFRI method can take 
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care some of the crystal parking effect. 
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